Hw-02-02

Author

Ajinkya Deshmukh

Imputing like a Data Scientist

Exploring , visualizing, and imputing outliers and missing values (NAs) in a novel data set and produce quality graphs and tables

Learning from this section

  1. Loading and exploring a data set with quality tables

  2. Diagnose outliers and missing values

  3. Imputing outliers and missing values

Required Setup

# Sets the number of significant figures to 2
options(digits = 2)

# Required package for quick package downloading and loading 
if (!require(pacman))
  install.packages("pacman")
Loading required package: pacman
pacman::p_load(cluster, # K cluster analyses
               dlookr, # Exploratory data analysis
               formattable, # HTML tables from R outputs
               ggfortify, # Plotting tools for stats
               ggpubr, # Publishable ggplots
               here, # Standardizes paths to data
               kableExtra, # Alternative to formattable
               knitr, # Needed to write HTML reports
               missRanger, # To generate NAs
               plotly, # Visualization package
               rattle, # Decision tree visualization
               rpart, # rpart algorithm
               tidyverse, # Powerful data wrangling package suite
               visdat) # Another EDA visualization package

# Set global ggplot() theme
# Theme pub_clean() from the ggpubr package with base text size = 16
theme_set(theme_pubclean(base_size = 16)) 
# All axes titles to their respective far right sides
theme_update(axis.title = element_text(hjust = 1))
# Remove axes ticks
theme_update(axis.ticks = element_blank()) 
# Remove legend key
theme_update(legend.key = element_blank())

Loading and Examining a Data set (“artists.csv”)

# Let's load a data set from the artists data set
dataset <- read.csv("artists.csv")
dataset |>
  glimpse()
Rows: 3,162
Columns: 14
$ artist_name                <chr> "Aaron Douglas", "Aaron Douglas", "Aaron Do…
$ edition_number             <dbl> 9, 10, 11, 99, 13, 14, 15, 16, 14, 15, 16, …
$ year                       <int> 1991, 1996, 2001, 2005, 2009, 2013, 2016, 2…
$ artist_nationality         <chr> "American", "American", "American", "Americ…
$ artist_nationality_other   <chr> "American", "American", "American", "Americ…
$ artist_gender              <chr> "Male", "Male", "Male", "Male", "Male", "Ma…
$ artist_race                <chr> "Black or African American", "Black or Afri…
$ artist_ethnicity           <chr> "Not Hispanic or Latino origin", "Not Hispa…
$ book                       <chr> "Gardner", "Gardner", "Gardner", "Gardner",…
$ space_ratio_per_page_total <dbl> 0.35, 0.37, 0.30, 0.38, 0.40, 0.45, 0.48, 0…
$ artist_unique_id           <int> 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 6, 6, 6, 6…
$ moma_count_to_year         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ whitney_count_to_year      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ artist_race_nwi            <chr> "Non-White", "Non-White", "Non-White", "Non…
# What does the data look like?
dataset |>
  head() |>
  formattable()
artist_name edition_number year artist_nationality artist_nationality_other artist_gender artist_race artist_ethnicity book space_ratio_per_page_total artist_unique_id moma_count_to_year whitney_count_to_year artist_race_nwi
Aaron Douglas 9 1991 American American Male Black or African American Not Hispanic or Latino origin Gardner 0.35 2 0 0 Non-White
Aaron Douglas 10 1996 American American Male Black or African American Not Hispanic or Latino origin Gardner 0.37 2 0 0 Non-White
Aaron Douglas 11 2001 American American Male Black or African American Not Hispanic or Latino origin Gardner 0.30 2 0 0 Non-White
Aaron Douglas 99 2005 American American Male Black or African American Not Hispanic or Latino origin Gardner 0.38 2 0 0 Non-White
Aaron Douglas 13 2009 American American Male Black or African American Not Hispanic or Latino origin Gardner 0.40 2 0 0 Non-White
Aaron Douglas 14 2013 American American Male Black or African American Not Hispanic or Latino origin Gardner 0.45 2 0 0 Non-White
  • From the dataset used we have displayed the attributes of the dataset in 2 formats:

  • data.frame: displays the attributes in the dataset with the datatypes.

  • formattable_widge: displays the attributes and the data associated with the attributes.

Diagnosing the Data set

# What are the properties of the data
dataset |>
  diagnose() |>
  formattable()
variables types missing_count missing_percent unique_count unique_rate
artist_name character 0 0.0 413 0.13061
edition_number numeric 86 2.7 18 0.00569
year integer 0 0.0 22 0.00696
artist_nationality character 0 0.0 52 0.01645
artist_nationality_other character 0 0.0 6 0.00190
artist_gender character 0 0.0 3 0.00095
artist_race character 0 0.0 6 0.00190
artist_ethnicity character 58 1.8 3 0.00095
book character 0 0.0 2 0.00063
space_ratio_per_page_total numeric 0 0.0 3065 0.96932
artist_unique_id integer 0 0.0 413 0.13061
moma_count_to_year integer 0 0.0 54 0.01708
whitney_count_to_year integer 0 0.0 37 0.01170
artist_race_nwi character 0 0.0 2 0.00063
  • Displaying the properties of the dataset like datatypes.

Diagnosing Outliers

# Table showing outliers
dataset |>
  diagnose_outlier() |>
  filter(outliers_ratio > 0) |>  
  mutate(rate = outliers_mean / with_mean) |>
  arrange(desc(rate)) |> 
  select(-outliers_cnt) |>
  formattable()
variables outliers_ratio outliers_mean with_mean without_mean rate
edition_number 2.5 99.0 10.38 8.01 9.54
moma_count_to_year 8.5 25.6 4.31 2.33 5.94
whitney_count_to_year 20.6 9.5 1.96 0.00 4.85
space_ratio_per_page_total 8.6 1.5 0.53 0.43 2.92
year 4.1 1939.9 1994.24 1996.58 0.97
  • Displaying the outliers count for the numerical attributes in the dataset
# Boxplots and histograms of data with and without outliers
dataset |>
  select(find_outliers(dataset)) |>
           plot_outlier()

  • Plotting the outliers calculated for the numerical attributes.

Exploration of Missing Values (NAs)

# Randomly generate NAs for 30
na.dataset <- dataset |>
  generateNA(p = 0.3)

# First six rows
na.dataset |>
head() |>
  formattable()
artist_name edition_number year artist_nationality artist_nationality_other artist_gender artist_race artist_ethnicity book space_ratio_per_page_total artist_unique_id moma_count_to_year whitney_count_to_year artist_race_nwi
NA 9 1991 NA American Male Black or African American Not Hispanic or Latino origin Gardner 0.35 2 NA 0 Non-White
Aaron Douglas NA 1996 American American Male Black or African American Not Hispanic or Latino origin NA NA 2 NA 0 Non-White
Aaron Douglas 11 2001 American American NA NA NA Gardner 0.30 NA 0 0 Non-White
Aaron Douglas 99 2005 NA NA NA Black or African American NA Gardner 0.38 NA 0 0 Non-White
NA NA NA American American Male Black or African American NA Gardner NA NA 0 0 NA
Aaron Douglas NA 2013 American American NA NA Not Hispanic or Latino origin NA 0.45 2 0 NA Non-White
  • Displaying the dataset after adding NA values in the attributes.
# Create the NA table
na.dataset |>
  plot_na_pareto(only_na = TRUE, plot = FALSE) |>
  formattable() # Publishable table
variable frequencies ratio grade cumulative
edition_number 1006 0.32 Bad 7.5
artist_ethnicity 990 0.31 Bad 14.9
artist_gender 949 0.30 Bad 22.0
artist_name 949 0.30 Bad 29.1
artist_nationality 949 0.30 Bad 36.2
artist_nationality_other 949 0.30 Bad 43.3
artist_race 949 0.30 Bad 50.4
artist_race_nwi 949 0.30 Bad 57.5
artist_unique_id 949 0.30 Bad 64.5
book 949 0.30 Bad 71.6
moma_count_to_year 949 0.30 Bad 78.7
space_ratio_per_page_total 949 0.30 Bad 85.8
whitney_count_to_year 949 0.30 Bad 92.9
year 949 0.30 Bad 100.0
  • Creating and the NA table for the dataset

Plot showing the frequency of missing values

# Plot the insersect of the columns with missing values
# This plot visualizes the table above
na.dataset |>
  plot_na_pareto(only_na = TRUE)

  • Plotting the intersect of the columns with missing values in the dataset.

Advanced Exploration of Missing Values (NAs)

  1. Intersect plot shows that for every relevant combination of columns, how many missing values are common.

  2. Orange boxes are the columns which are in question.

  3. x axis shows the number of missing values in that column.

  4. y axis shows the number of missing values in the columns in orange blocks.

# Plot the intersect of the 3 columns with the most missing values
# This means that some combinations of columns have missing values in the same row
na.dataset |>
  select(year, artist_unique_id, whitney_count_to_year) |>
  plot_na_intersect(only_na = TRUE) 

  • Plot the intersect of the 3 columns with the most missing values

Determining if NA Observations are the Same

  1. Missing values can be same observation across several columns.

  2. The visdat package can solve this with the vis_miss() function which shows the rows with missing values through ggplotly().

  3. Below we will show ALL the columns with NAs, and we can zoom into individual rows.

# Interactive plotly() plot of all NA values to examine every row
na.dataset |>
 select(year, artist_race, artist_gender) |>
 vis_miss() |>
 ggplotly() 
  • Plotting all the NA values to examine every row of the dataset

Impute Outliers and NAs

  • The principle goal for all imputation is to find the method that does not change the distribution too much.

  • There are several methods to remove outliers and NAs, some of these methods are discussed below with benefits and costs for each.

# Box plot
dataset %>% # Set the simulated normal data as a data frame
  ggplot(aes(x = year, y = artist_gender, fill = artist_gender)) + # Create a ggplot
  geom_boxplot(width = 0.5, outlier.size = 2, outlier.alpha = 0.5) +
  xlab("Year") +  # Relabel the x axis label
  ylab("Artist Gender") + # Remove the y axis label
  theme(legend.position = "top")

  • Plotting the calculated outliers from the dataset.
  1. To remove outliers we use imputate_outlier() and replace them with values that are estimates based on the existing data
  • mean: arithmetic mean

  • median: median

  • mode: mode

  • capping: Imputing the upper outliers with 95 percentile, and impute the bottom outliers with 5 percentile.

Mean Imputation

Outliers for the computed variables are imputed by this mean

# Raw summary, output suppressed
mean_out_imp_edition_number <- na.dataset |>
  select(edition_number) |>
  imputate_outlier(edition_number, method = "mean")

# Output showing the summary statistics of our imputation
mean_out_imp_edition_number |>
  summary() 
Impute outliers with mean

* Information of Imputation (before vs after)
                    Original Imputation
described_variables "value"  "value"   
n                   "2156"   "2156"    
na                  "1006"   "1006"    
mean                "11"     " 8"      
sd                  "15.6"   " 4.4"    
se_mean             "0.336"  "0.094"   
IQR                 "8"      "6"       
skewness            "5.05"   "0.28"    
kurtosis            "25.93"  "-0.99"   
p00                 "1"      "1"       
p01                 "1"      "1"       
p05                 "1.1"    "1.1"     
p10                 "2.5"    "2.5"     
p20                 "4"      "4"       
p25                 "5"      "5"       
p30                 "5"      "5"       
p40                 "6"      "6"       
p50                 "7"      "7"       
p60                 "9"      "9"       
p70                 "11"     "11"      
p75                 "13"     "11"      
p80                 "14"     "13"      
p90                 "15"     "15"      
p95                 "16"     "16"      
p99                 "99"     "16"      
p100                "99"     "16"      
# Visualization of the mean imputation
mean_out_imp_edition_number |>
  plot()

  • Displaying the mean plot after imputing the numerical attribute.

Median Imputation

Outliers for the compared variables are imputed by this median

# Raw summary, output suppressed
med_out_imp_edition_number <- dataset |>
  select(edition_number) |>
  imputate_outlier(edition_number, method = "median")

# Output showing the summary statistics of our imputation
med_out_imp_edition_number |>
  summary()
Impute outliers with median

* Information of Imputation (before vs after)
                    Original Imputation
described_variables "value"  "value"   
n                   "3076"   "3076"    
na                  "86"     "86"      
mean                "10"     " 8"      
sd                  "15.1"   " 4.4"    
se_mean             "0.273"  "0.079"   
IQR                 "8"      "6"       
skewness            "5.20"   "0.34"    
kurtosis            "27.71"  "-0.95"   
p00                 "1"      "1"       
p01                 "1"      "1"       
p05                 "1.1"    "1.1"     
p10                 "3"      "3"       
p20                 "4"      "4"       
p25                 "5"      "5"       
p30                 "5"      "5"       
p40                 "6"      "6"       
p50                 "7"      "7"       
p60                 "9"      "8"       
p70                 "11"     "10"      
p75                 "13"     "11"      
p80                 "14"     "13"      
p90                 "15"     "15"      
p95                 "16"     "16"      
p99                 "99"     "16"      
p100                "99"     "16"      
# Visualization of the median imputation
med_out_imp_edition_number |>
  plot()

  • Displaying the median plot after imputing the numerical attribute.

Mode Imputation

Outliers for the compared variables are imputed by this mode

# Raw summary, output suppressed
mode_out_imp_edition_number <- dataset |>
  select(edition_number) |>
  imputate_outlier(edition_number, method = "mode")

# Output showing the summary statistics of our imputation
mode_out_imp_edition_number |>
  summary()
Impute outliers with mode

* Information of Imputation (before vs after)
                    Original Imputation
described_variables "value"  "value"   
n                   "3076"   "3076"    
na                  "86"     "86"      
mean                "10"     " 8"      
sd                  "15.1"   " 4.4"    
se_mean             "0.273"  "0.079"   
IQR                 "8"      "6"       
skewness            "5.20"   "0.34"    
kurtosis            "27.71"  "-0.95"   
p00                 "1"      "1"       
p01                 "1"      "1"       
p05                 "1.1"    "1.1"     
p10                 "3"      "3"       
p20                 "4"      "4"       
p25                 "5"      "5"       
p30                 "5"      "5"       
p40                 "6"      "6"       
p50                 "7"      "7"       
p60                 "9"      "8"       
p70                 "11"     "10"      
p75                 "13"     "11"      
p80                 "14"     "13"      
p90                 "15"     "15"      
p95                 "16"     "16"      
p99                 "99"     "16"      
p100                "99"     "16"      
# Visualization of the mode imputation
mode_out_imp_edition_number |>
plot()

  • Displaying the mode plot after imputing the numerical attribute.

K-Nearest Neighbor (KNN) Imputation

#check for missing values
any(is.na(dataset))
[1] TRUE
#Check for infinite values
any(is.infinite(dataset$edition_number))
[1] FALSE
#Impute missing values
dataset <- na.omit(dataset)
# KNN plot of our dataset without categories
autoplot(clara(dataset[-6], 3))

  • Plotting the KNN Imputation for the dataset.
library(magrittr)

Attaching package: 'magrittr'
The following object is masked from 'package:purrr':

    set_names
The following object is masked from 'package:tidyr':

    extract
The following object is masked from 'package:dlookr':

    extract
non_numeric <- dataset %>%
  select_if(is.numeric)
# Raw summary, output suppressed
knn_na_imp_space <- non_numeric %>%
  imputate_na(edition_number, method = "knn")

# Plot showing the results of our imputation
knn_na_imp_space %>%
  plot()

  • Plotting the results obtained after imputing the dataset and removing the NA and infinite values.

Recursive Partitioning and Regression Trees (rpart)

rpart is a decision tree machine learning algorithm that builds classification or regression models through a two stage process, which can be thought of as binary trees. The algorithm splits the data into subsets, which move down other branches of the tree until a termination criteria is reached.

library(magrittr)
non_numeric <- na.dataset %>%
  select_if(is.numeric)
# Raw summary, output suppressed
rpart_na_imp_space <- non_numeric %>%
  imputate_na(edition_number, method = "rpart")

# Plot showing the results of our imputation
rpart_na_imp_space %>%
  plot()

  • Plotting the results obtained after imputing the dataset and splitting the dataset into subsets.

Multivariate Imputation by Chained Equations (MICE)

MICE is an algorithm that fills missing values multiple times, hence dealing with uncertainty better than other methods. This approach creates multiple copies of the data that can then be analyzed and then pooled into a single dataset.

library(magrittr)
non_numeric <- na.dataset %>%
  select_if(is.numeric)
# Raw summary, output suppressed
mice_na_imp_edition_number <- non_numeric %>%
  imputate_na(edition_number, method = "mice", seed = 123)

 iter imp variable
  1   1  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  1   2  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  1   3  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  1   4  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  1   5  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  2   1  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  2   2  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  2   3  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  2   4  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  2   5  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  3   1  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  3   2  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  3   3  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  3   4  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  3   5  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  4   1  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  4   2  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  4   3  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  4   4  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  4   5  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  5   1  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  5   2  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  5   3  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  5   4  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  5   5  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
library(magrittr)
non_numeric <- na.dataset %>%
  select_if(is.numeric)
# Raw summary, output suppressed
mice_na_imp_edition_number <- non_numeric %>%
  imputate_na(edition_number, method = "mice", seed = 123)

 iter imp variable
  1   1  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  1   2  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  1   3  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  1   4  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  1   5  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  2   1  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  2   2  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  2   3  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  2   4  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  2   5  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  3   1  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  3   2  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  3   3  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  3   4  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  3   5  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  4   1  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  4   2  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  4   3  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  4   4  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  4   5  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  5   1  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  5   2  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  5   3  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  5   4  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
  5   5  edition_number  year  space_ratio_per_page_total  artist_unique_id  moma_count_to_year  whitney_count_to_year
# Plot showing the results of our imputation
mice_na_imp_edition_number |>
  plot()

  • Plotting the results obtained after imputing the dataset after dealing with uncertainty.